Implement ORB AWS EC2 Worker Adapter#525
Merged
sharpener6 merged 88 commits intofinos:mainfrom Mar 27, 2026
Merged
Conversation
1339a57 to
1ebb8d9
Compare
0182b36 to
2c2573a
Compare
- Include submit_tasks.py in examples readme and documentation. - Implement skip_examples.txt for top-level examples in CI. - Add submit_tasks.py to skip_examples.txt as it requires a running scheduler.
gxuu
reviewed
Feb 10, 2026
src/scaler/worker_manager_adapter/orb/config/default_config.json
Outdated
Show resolved
Hide resolved
gxuu
reviewed
Feb 10, 2026
- Use main's new worker_managers/ docs structure (PR finos#611) - Move worker_manager_adapter/orb.rst to worker_managers/orb.rst - Add ORB entry to worker_managers/index.rst - Accept deletion of reorganized files (examples.rst, worker_manager_adapter/index.rst, common_parameters.rst)
Resolved conflicts: - pyproject.toml: keep unified scaler_worker_manager + scaler entry points from main, retain scaler_worker_manager_orb from orb branch - README.md: use main's unified CLI command naming in TOML section table, add orb_worker_adapter row - tests/config/test_config_class.py: use bytes literal (b""") from main for mock_open read_data - docs/source/tutorials/configuration.rst: accept deletion from main
- Register ORBWorkerAdapterConfig with _tag = "orb" for discriminator-based TOML parsing in the scaler all-in-one launcher - Add orb subcommand to scaler_worker_manager dispatcher - Add ORBWorkerAdapterConfig to WorkerManagerUnion in scaler.py - Remove redundant top-level event_loop and worker_io_threads fields from ORBWorkerAdapterConfig in favour of the existing worker_config equivalents - Update docs (commands.rst, orb.rst) and README to reflect the unified entry point - Add tests for orb subcommand parsing, TOML config, and _run_worker_manager dispatch
The orb worker manager is now accessible via the unified scaler_worker_manager orb subcommand, making the dedicated entry point redundant.
When ORBClient is initialised with app_config=, its _ensure_raw_config()
merges only default_config.json (which has provider_defaults: {}) with the
caller-supplied dict, skipping the _load_strategy_defaults() call that
normally loads aws_defaults.json. As a result get_effective_handlers()
returns {} and RunInstances is absent from supported_apis, causing:
ApplicationError: Provider does not support API 'RunInstances'. Supported APIs: []
Fix by including provider_defaults.aws.handlers explicitly in
_build_app_config() so the RunInstances handler definition is always
present regardless of how ORB loads its config.
When the ORB adapter is at capacity it returns TooManyWorkers, but the scheduler's worker count (based on received heartbeats) may still be below max_task_concurrency because newly-created instances haven't sent their first heartbeat yet. This caused the scheduler to re-request a worker on every heartbeat, spamming the log. Fix: track sources that have returned TooManyWorkers and suppress new StartWorkers requests for that source until the scheduler's own worker count drops below max_task_concurrency (indicating a worker left and the ORB adapter has freed up capacity). Also fix a latent bug in all three scaling policies where the capacity check `len(managed) >= max_task_concurrency` is always True when max_task_concurrency == -1 (unlimited), blocking all scaling.
The module-level `from orb import ORBClient as orb` caused CI tests to fail when patching ORBWorkerAdapter, because importing the module triggered the import of `orb` which is not installed in CI. Moving the import inside `_run()` defers it until the adapter is actually used.
Replace the deprecated scaler_cluster command with scaler_worker_manager baremetal_native, passing --mode fixed and --worker-manager-id sourced from ec2-metadata.
Fix incorrect version.txt path in build.sh (was two levels up, should be three), and add the newly built AMI ami-0b76605999d8f5d2b for scaler 1.26.4 / Python 3.13 to the ORB docs table.
DEFAULT_MAX_TASK_CONCURRENCY was cpu_count() - 1, which evaluates to 0 on single-core machines. Remove the subtraction so at least one worker is started by default.
The _at_capacity_sources clearing condition was inverted: it cleared suppression when managed_worker_ids < max_task_concurrency, which is exactly the case during EC2 boot (0 workers, instance not yet registered). This caused the scheduler to resume spamming StartWorkers on the very next heartbeat after receiving TooManyWorkers. Replace the Set-based approach with a baseline Dict that records the managed worker count at the time TooManyWorkers was received. Suppression is now held until the scheduler's view of workers grows beyond that baseline, i.e. at least one booting instance has sent its first heartbeat.
sharpener6
reviewed
Mar 27, 2026
# Conflicts: # src/scaler/scheduler/controllers/policies/simple_policy/scaling/fixed_elastic.py
Renames all identifiers, file names, directories, config tags, CLI subcommands, docs, README, and tests from `orb` / `ORBWorkerAdapter` to `orb_aws_ec2` / `ORBAWSEC2WorkerAdapter` to make clear this adapter is specifically for AWS EC2 via the ORB SDK.
sharpener6
approved these changes
Mar 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request implements the ORB AWS EC2 Worker Adapter (
orb_aws_ec2), enabling Scaler to dynamically scale worker instances on AWS using the ORB Python SDK (replacing the original CLI-based approach).Key Changes
ORBHelper(subprocess wrapper around theorbCLI) with directORBClientSDK usage. Config is built in-memory fromORBAWSEC2WorkerAdapterConfigfields — no file copying, no temp dirs for templates or user data.orb_config_path: No longer needed; the provider config is constructed programmatically via_build_app_config()._poll_for_instance_id()is now fully async (asyncio.sleepinstead oftime.sleep+run_in_executor).ami/directory with Packer configuration (opengris-scaler.pkr.hcl) and a build script (build.sh) to create AMIs pre-configured withopengris-scaler.ORBAWSEC2WorkerAdapterConfigfor detailed adapter settings. Removed redundant top-levelevent_loopandworker_io_threadsfields — these are now inherited from the standardworker_config.cpu_count - 1workers, wherecpu_countis determined by the machine type configured by the user.scaler_worker_manager(as theorb_aws_ec2subcommand) and thescalerall-in-one launcher (type = "orb_aws_ec2"in[[worker_manager]]). The dedicatedscaler_worker_manager_orbentry point has been removed.WorkerAdapterControllernow waits for a pending command to complete before sending a new one, preventing duplicateStartWorkerGroupcommands during the long ORB polling period.Bug Fixes
StartWorkersimmediately after receivingTooManyWorkers. Replaced theSet-based approach with a baselineDictthat records the managed worker count at the timeTooManyWorkerswas received — suppression is now held until at least one booting instance registers.DEFAULT_MAX_TASK_CONCURRENCYwascpu_count() - 1, which evaluates to0on single-core machines. Removed the subtraction so at least one worker is started by default.ORBClientis initialised withapp_config=, it skips_load_strategy_defaults(), causingRunInstancesto be absent from supported APIs. Fixed by includingprovider_defaults.aws.handlersexplicitly in_build_app_config().orb: Moved thefrom orb import ORBClientimport inside_run()to defer it until the adapter is actually used, fixing CI test failures whenorbis not installed.scaler_worker_manager baremetal_native --mode fixed, replacing the deprecatedscaler_clustercommand.Dependencies
orb-pyandboto3to theorb_aws_ec2andallextra dependency groups inpyproject.tomlUsage
Workers launched by the ORB AWS EC2 adapter are EC2 instances that connect back to the scheduler over the network. The scheduler address, object storage address, and object storage server must all be externally reachable from those instances (e.g. a private VPC IP or public IP).